Model

An extension of simple regression with multiple predictors to better aid in the estimation and prediction of the response. The goal is to determine the effects (if any) of each predictor, controlling for the others.

\[ Y_i=\beta_0+\beta_1 X_{1,i}+\cdots+\beta_{p-1} X_{p-1,i}+\epsilon_i \Leftrightarrow Y_i=\sum_{k=0}^{p}\beta_kX_{k,i}+\epsilon_i\quad X_{0,i}\equiv 1 \]

where \(\epsilon_i\stackrel{iid}{\sim} N(0,\sigma^2)\)

Polynomial term

Useful for accounting for potential curvature/nonlinearity in the relationship between predictors and the response.

Example:
Regression with 5 predictors but with only 2 unique predictors

\[ Y_i=\beta_0+\beta_1 X_{1,i}+\beta_2 X_{1,i}^2+\beta_3 X_{2,i}+\beta_4 X_{1,i}X_{2,i}+\beta_5 X_{1,i}^2X_{2,i} +\epsilon_i \]

Regression surface

1-unit increase

In a model with polynomial/interaction terms special care needs to be taken as an increase in a predictor causes other changes too.

For example \[ E(Y|X_1,X_2)=\beta_0+\beta_1X_{1}+\beta_2X_{2}+\beta_3\underbrace{X_{1}X_{2}}_{X_3} \]

where a 1-unit increase in \(X_2\), i.e. \(X_2+1\), leads to

\[ E(Y|X_1,X_2+1)=E(Y|X_1,X_2)+\beta_2+\beta_3X_{1} \]

The effect of increasing \(X_2\) by 1, is dependent on the level of \(X_1\).

Example (Flyash)

dat = matrix(c(0, 4779, 0, 4706, 0, 4350, 20, 5189, 20, 5140,
    20, 4976, 30, 5110, 30, 5685, 30, 5618, 40, 5995, 40, 5628,
    40, 5897, 50, 5746, 50, 5719, 50, 5782, 60, 4895, 60, 5030,
    60, 4648), 18, 2, byrow = TRUE, dimnames = list(c(), c("flyash",
    "strength")))

plot(dat)
dat = as.data.frame(dat)
reg.lin = lm(strength ~ flyash, data = dat)
summary(reg.lin)
## 
## Call:
## lm(formula = strength ~ flyash, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -901.62 -203.18   31.56  327.55  653.72 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4924.595    213.299  23.088 1.03e-13 ***
## flyash        10.417      5.507   1.891   0.0768 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.8 on 16 degrees of freedom
## Multiple R-squared:  0.1827, Adjusted R-squared:  0.1317 
## F-statistic: 3.578 on 1 and 16 DF,  p-value: 0.0768
abline(reg.lin)

# Add 2nd order polynomial
flyash2 = dat$flyash^2
reg.2poly = lm(strength ~ flyash + flyash2, data = dat)
summary(reg.2poly)
## 
## Call:
## lm(formula = strength ~ flyash + flyash2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -477.70 -214.01   27.04  287.87  390.78 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4486.3611   174.7531  25.673 8.25e-14 ***
## flyash        63.0052    12.3725   5.092 0.000132 ***
## flyash2       -0.8765     0.1966  -4.458 0.000460 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 312.1 on 15 degrees of freedom
## Multiple R-squared:  0.6485, Adjusted R-squared:  0.6016 
## F-statistic: 13.84 on 2 and 15 DF,  p-value: 0.0003933
curve(reg.2poly$coefficients[1] + reg.2poly$coefficients[2] *
    x + reg.2poly$coefficients[3] * x^2, from = 0, to = 60, col = 2,
    add = TRUE)



legend(0, 6000, legend = c("1st order", "2nd order"), lty = 1,
    col = 1:2, bg = "light gray")

Example (Safety)

\[\begin{align*} Y &= \mbox{ lost work hours }\\ X_1 &= \mbox{ number of employees }\\ X_2 &= \begin{cases} 1 & \mbox{safety program used}\\ 0 & \mbox{no safety program used} \end{cases} \end{align*}\]

Include x1 and x2 in the model

\[ E(Y_i)=\beta_0+\beta_1 X_{1,i} + \beta_2 X_{2,i} \]

implies

\[ E(Y_i)=\begin{cases} (\beta_0+\beta_2)+\beta_1 X_{1,i} & \mbox{ if }X_2=1\\ \beta_0+\beta_1 X_{1,i} & \mbox{ if }X_2=0 \end{cases} \]

dat = read.csv("https://raw.githubusercontent.com/lingxiaozhou/STA4210Rmaterial/main/data/safe_reg.csv",
    header = TRUE)


reg1 = lm(y ~ x1 + x2, data = dat)
summary(reg1)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.923 -16.260  -2.347  13.562  67.300 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.39945    9.90247   3.171  0.00305 ** 
## x1            0.01421    0.00140  10.148 3.07e-12 ***
## x2          -54.21033    7.24299  -7.485 6.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.87 on 37 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.8054 
## F-statistic: 81.73 on 2 and 37 DF,  p-value: 2.658e-14
coef1 = coef(reg1)

plot(dat$x1, dat$y, pch = dat$x2 + 1, col = dat$x2 + 1)
abline(coef1["(Intercept)"], coef1["x1"], col = 1)
abline(coef1["(Intercept)"] + coef1["x2"], coef1["x1"], col = 2)
legend("bottomright", levels(x2), pch = 1:2, col = 1:2, title = "x2",
    legend = c(0, 1))

Include the interaction x1*x2 in the model

\[ E(Y_i)=\beta_0+\beta_1 X_{1,i} + \beta_2 X_{2,i} +\beta_3 (X_1 X_2)_i \]

which implies

\[ E(Y_i)=\begin{cases} (\beta_0+\beta_2)+(\beta_1+\beta_3) X_{1,i} & \mbox{ if }X_2=1\\ \beta_0+\beta_1 X_{1,i} & \mbox{ if }X_2=0 \end{cases} \]

reg2 = lm(y ~ x1 + x2 + x1:x2, data = dat)  #same as lm(y~x1+x2+x1*x2) or lm(y~x1*x2)
summary(reg2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.407 -10.328  -2.760   8.563  65.534 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.844082  10.127410  -0.182    0.857    
## x1           0.019749   0.001546  12.777 6.11e-15 ***
## x2          10.725385  14.054508   0.763    0.450    
## x1:x2       -0.010957   0.002174  -5.041 1.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.75 on 36 degrees of freedom
## Multiple R-squared:  0.8918, Adjusted R-squared:  0.8828 
## F-statistic:  98.9 on 3 and 36 DF,  p-value: < 2.2e-16
coef2 = coef(reg2)

plot(dat$x1, dat$y, pch = dat$x2 + 1, col = dat$x2 + 1)
abline(coef2["(Intercept)"], coef2["x1"], col = 1)
abline(coef2["(Intercept)"] + coef2["x2"], coef2["x1"] + coef2["x1:x2"],
    col = 2)
legend("bottomright", levels(x2), pch = 1:2, col = 1:2, title = "x2",
    legend = c(0, 1))